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ABSTRACT 

This  report  describes  the  parallel  implementation  of  the  Hough  Transform,  a  tech¬ 
nique  to  detect  collinear  edge  points.  Specifically,  we  consider  two  contrasting  architec¬ 
tures,  the  Butterfly  Parallel  Processor1,  essentially  a  shared  memory  machine,  and  the 
NCUBE2,  a  direct  connection  machine  in  which  processors  are  interconnected  in  the 
form  of  a  hypercube. 

Developing  parallel  Hough  transform  algorithms  involves  addressing  questions  of 
optimal  processor  allocation  and  parallel  “peak”  selection  in  image  neighborhoods.  We 
present  fast  practical  algorithms  (subject  to  inherent  lower  bounds)  and  discuss  the 
relevant  complexity  issues. 
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1  Introduction 


This  paper  discusses  parallel  algorithms  and  their  implementation  for  the  class  of  opera¬ 
tions  that  involve  computing  two-dimensional  histograms  of  image  features.  An  important 
class  of  such  operations  are  the  so-called  Hough  Transforms  [DH72],  which  can  be  used 
to  detect  straight  lines  or  edges  (or  more  generally,  arbitrarily  shaped  point  patterns) 
in  an  image.  This  problem  is  representative  of  computationally  intensive  intermediate 
level  vision  problems  and  is  one  that  merits  using  a  machine  whose  architecture  is  non 
Von-Neumann. 

This  paper  is  concerned  with  two  machines  of  very  different  characteristics.  The  first  is 
the  Butterfly  Parallel  Processor,  a  tightly-coupled,  shared  memory  machine.  Our  programs 
were  implemented  on  a  sixteen  node  machine  at  the  Center  for  Automation  Research.  The 
other  machine  is  a  sixty-four  node  NCUBE  machine  available  at  the  University  of  Mary¬ 
land’s  Center  for  Advanced  Research  in  Biotechnology.  The  processors  in  this  machine 
have  some  local  memory  and  are  located  at  the  corners  of  a  hypercube. 

The  main  issue  in  parallel  processing  is  the  efficient  use  of  processors.  Let  P  be  the 
number  of  processors  and  let  N  be  a  measure  of  problem  size.  Since  physical  multiproces¬ 
sors  will  be  built  with  a  finite  number  of  processors,  it  may  well  be  that  the  user  pushes 
the  limits  of  the  machine  by  taking  on  a  very  large  problem.  (It  is  also  not  hard  to  vi¬ 
sualize  problems  broken  into  modules  and  cases  where  the  multiprocessor  is  essentially 
partitioned.)  Here  we  have  a  case  where  P  <  N.  It  is  also  important  to  consider  the  case 
when  P  >  N  since,  as  Arvind  [AI86]  points  out,  one  wants  processors  to  be  scalable  in 
such  a  manner  that  adding  hardware  resources  results  in  increased  performance;  one  wants 
to  analyze  the  asymptotics  of  this  behavior.  Accordingly,  it  is  essential  that  we  discuss 
independently  how  the  algorithm  and  the  architecture  behave  as  the  number  of  processors 
and  the  problem  size  vary. 


In  all  the  cases,  the  goal  is  that  the  time  to  solve  the  problem  sequentially  be  P  times 
the  time  to  solve  the  problem  in  parallel,  but  it  is  not  clear  whether  this  can  be  achieved 
for  all  input  sizes.  If  we  can  achieve  this,  the  parallel  algorithm  is  optimal.  Let  Ti(N) 
represent  the  time  complexity  of  the  fastest  algorithm  on  a  problem  of  size  N  using  a 
single  processor.  Likewise,  let  Tp(JV)  be  the  complexity  of  a  parallel  algorithm  using  P 
processors.  Given  any  parallel  algorithm  A  for  a  given  problem  along  with  a  particular 
model  of  computation,  then  there  is  a  processor  efficiency  function(PEF)  S(N)  such  that 
for  1  <  P  <  S(N) 

TP(N)=0(T1(N)/P). 

Given  two  algorithms  Ai  and  A2,  and  the  corresponding  PEFs  Sy(N)  and  S2(N),  we 
say  that  Ay  is  more  processor  efficient  than  A2  if  there  exists  Ny  such  that  for  all  N  >  Ny, 
Sy(N )  >  S2(N)1.  A  subgoal  in  algorithm  design  thus  is  to  maximize  the  PEF  and  one 
characterizes  an  algorithm  A  by  saying  that  it  is  in  the  Class  KS(n)- 

Alternatively,  one  can  consider  for  any  P,  the  minimal  input  size  N  for  which  the 
algorithm  is  optimal  and  define  an  analogous  data  efficiency  function  (DEF).  Thus,  given 
an  optimal  algorithm  B,  there  exists  a  function  D(P)  such  that  for  all  N  >  D(P), 

TP(N)  =  0(Ty(N)/P) 

The  algorithm  B  is  then  said  to  be  in  the  Class  Jd(P)  and  the  subgoal  is  to  minimize  D(P). 
Given  two  algorithms  By  and  B2  and  the  corresponding  DEFs  Dy(P)  and  D2(P ),  we  say 
that  By  is  more  data  efficient  than  B2  if  Dy(P)  =  o(D2(P))  2.  Section  4.1.1  illustrates 
these  concepts. 

The  remainder  of  this  paper  is  organized  as  follows.  Section  2  defines  the  specific  type 
of  Hough  transform  that  we  are  interested  in  implementing  and  describes  previous  work 

*In  other  words,  Si(W)  =  Q(S2(A)). 

Given  functions  f  end  q  of  x,  f  —  o( j)  if  lim*  — * oo  <//?)=  o. 


done  in  this  area.  Section  3  describes  the  characteristics  of  the  parallel  machines  we  are 
considering.  Section  4  presents  the  algorithms  and  discusses  the  relevant  complexity  issues 
pertaining  to  them  as  well  as  some  useful  implementation  strategies.  Section  5  contains 
some  concluding  remarks. 


2  Definition  of  the  Problem 


A  common  problem  in  computer  image  processing  is  the  detection  of  straight  lines  in 
digitized  images.  In  the  simplest  case,  the  image  contains  a  number  of  discrete  feature 
points  perhaps  corresponding  to  the  edge  pixels  obtained  by  a  local  operator  (see  below). 
The  problem  is  to  detect  the  presence  of  groups  of  collinear  or  almost  collinear  feature 
points.  It  is  clear  that  the  problem  can  be  solved  to  any  desired  degree  of  accuracy  by 
testing  the  lines  formed  by  all  pairs  of  points.  However,  the  computation  required  for  n 
points  is  approximately  proportional  to  n2,  and  may  be  prohibitive  for  large  n. 

Rosenfeld  [Ros69]  describes  a  method  originally  proposed  by  Hough  [Hou62]  and  sub¬ 
sequently  improved  upon  by  Duda  and  Hart  [DH72]  to  solve  the  problem.  Essentially,  a 
Hough  transform  is  designed  to  detect  collinear  sets  of  edge  pixels  in  an  image  by  mapping 
these  pixels  into  a  parameter  space  (the  Hough  space)  defined  in  such  a  way  that  collinear 
sets  of  pixels  in  the  image  give  rise  to  peaks  in  the  Hough  space. 

A  brief  discussion  of  the  method  follows.  The  edge  pixels  processed  by  the  transform  are 
obtained  by  applying,  for  example,  a  Sobel  gradient  operator  to  the  image.  Such  operators 
compute  digital  approximations  to  the  gray  level  gradient  magnitude  and  direction  at  each 
(non-border)  pixel  of  the  image. 

Now,  given  a  set  of  collinear  edge  points  {(xi,yi),  (x2,y2), . . . ,  (xn,yn)},  we  know  that 
they  all  must  satisfy  the  normal-angle  representation  of  a  line: 


p  =  Xi  x  cos  9  +  Vi  x  sin  9 
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The  key  observation  is  that  points  lying  on  the  same  straight  line  m  the  picture  plane 
correspond  to  curves  through  a  common  point  in  the  p,0  parameter  plane.  Thus,  the 

problem  of  finding  the  set  of  lines  in  the  image  plane  is  reduced  to  that  of  finding  common 

\ 

points  of  intersection  of  sinusoidal  curves  in  the  parameter  plane.  .. 

The  implementation  of  the  Hough  transform  for  detecting  straight  lines  on  a  sequential 
machine  involves  a  quantization  of  the  parameter  plane  into  a  quadrupled  grid.  The  grid 
size  is  determined  by  the  acceptable  errors  in  the  parameter  values  and  the  quantization  is 
confined  to  a  specific  region  of  the  parameter  plane  determined  by  the  range  of  parameter 
values.  A  two-dimensional  array  (the  accumulator  array  or  the  Hough  array)  is  then 
used  to  represent  the  parameter  plane  grid,  where  each  array  entry  corresponds  to  a 
grid  cell.  For  each  edge  pixel,  the  algorithm  increments  the  counts  in  all  accumulator 
array  entries  that  correspond  to  lines  passing  through  that  edge  pixel.  The  process  of 
incrementing  accumulator  array  counts  can  be  thought  of  as  “voting”  by  the  edge  pixels 
for  the  parameter  values  of  possible  lines  passing  through  these  points.  The  time  required 
to  execute  this  algorithm  is  proportional  to  the  size  s  of  the  grid,  plus  the  number  m  of 
edge  pixels  times  the  number  of  votes  v  cast  by  each  point  i.e.  O  (3  +  mv).  The  memory 
space  is  proportional  to  the  size  of  the  grid. 

Once  the  Hough  array  is  computed,  we  seek  to  find  those  array  elements  which  have 
high  values.  (The  number  of  these  depends  upon  the  constraints  posed  by  the  problem,  but 
it  is  clear  that  we  are  going  to  seek  a  subset  of  the  total  available  entries.)  However,  it  is  not 
just  enough  to  pick  the  first  few  high  values  because,  depending  upon  the  quantization,  the 
same  edge  pixel  will  contribute  to  a  “thick”  curve,  i.e.  votes  to  entries  that  are  adjacent. 
More  importantly,  because  of  the  inherent  non-linearity  of  the  transformation,  a  “real” 
cell  will  be  surrounded  by  “false”  votes  in  its  neighborhood.  These  problems  are  solved  by 
searching  for  the  local  maximum  in  an  /  x  I  square  neighborhood,  perhaps  after  smoothing 
the  Hough  array. 
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The  Hough  transform  has  an  extensive  literature.  In  the  sequential  case,  Hough  origi¬ 
nally  proposed  a  slope-intercept  parameterization.  Duda  suggested  the  normal-angle  rep¬ 
resentation  of  a  straight  line  in  order  to  bound  the  values  of  the  parameters.  The  technique 
has  been  extended  by  Kimme  et  al.  [KBS75]  to  detect  other  curves  defined  analytically, 
and  by  Ballard  [Bal8l]  to  detect  general  curve  shapes  using  the  edge  orientations  at  the 
image  points.  Yet  another  variation  which  avoids  the  use  of  transcendental  functions  is 
presented  in  [Wal85j.  Brown  [Bro84]  discusses  a  memory  efficient  implementation  of  the 
Hough  transform. 

Continued  interest  has  also  been  shown  in  the  parallel  version.  Silberberg  [Sil85]  de¬ 
scribes  a  Hough  algorithm  on  a  mesh-connected  computer.  This  has  been  subsequently 
improved  upon  by  Rosenfeld  et  al.  [ROH86].  They  consider  multiprocessors  executing 
instructions  in  a  Single  Instruction  Multiple  Data  (SIMD)  [Fly72]  fashion  and  where  the 
amount  of  local  memory  available  is  critical.  Merlin  et  al.  [Mer75]  have  a  method  for  de¬ 
tecting  a  general  curve  in  parallel  and  Ibrahim  et  al.  [IKS85]  have  discussed  a  simulation 
on  the  NON-VON  [Rab86]  architecture;  they  present  two  algorithms  based  on  an  SIMD 
and  an  MSIMD  approach  on  a  tree-like  machine.  Sher  [She85]  has  described  an  implemen¬ 
tation  of  the  general  template  matching  problem  on  the  WARP  systolic  multiprocessor 
and  on  a  Butterfly.  His  implementation  is  very  different  from  the  one  described  here,  as 
discussed  later. 

In  the  simplest  case,  the  steps  involved  in  detecting  large  collinear  sets  of  edge  or  feature 
points  in  a  binary  picture  may  be  summarized  as  follows:  Let  0mtn  to  0mKX  be  the  range  of 
angles  that  we  are  interested  in  and  let  Accum  refer  to  the  accumulator  array.  Then  the 
simple  algorithm  is  : 


.'O 
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p  =  \x  x  cos  6  +  y  x  sin  0  ] 
Accum  [p,  0)  =  Accum  [p,  0]  +  1 


Once  the  accumulator  array  is  computed,  we  must  next  find  the  k  highest  local  maxima. 
One  can  then  construct  another  array  which  has  as  its  entries  all  those  elements  which 
are  the  local  maxima  in  all  3  x  3  neighborhoods.  Next  we  could  find  the  first  k  values 
(for  instance  by  sorting  the  newly  constructed  array).  In  the  sequential  case,  there  is  an 
obvious  better  algorithm  which  skips  the  intermediate  step  and  directly  finds  the  first  k 
peaks.  This  is  done  using  an  array  of  k  items  (called  final-list)  which  stores  the  angle, 
distance  and  Hough  entry  values.  The  corresponding  simple  algorithm  that  suggests  itself 


for  each  (p,  0)  pair  entry 

if  local_maxima(entry)  then 
insert(angle,  distance,  value)  in  final-list. 
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It  is  not  obvious  if  this  algorithm  will  work  better  in  the  parallel  case  and  we  shall  have 
more  to  say  about  this  in  Section  4. 

3  Description  of  the  Machines 


$1 


In  this  section,  we  review  those  aspects  of  the  parallel  machines  that  are  relevant  to  the 
problem  at  hand.  For  more  information,  see  [The86,Chr86,NCU85]. 

The  NCUBE  is  a  “message-passing”  machine  configured  in  the  form  of  a  10-dimensional 
hypercube.  Advantages  of  this  topology  have  been  discussed  in  [Sei85,Pea77]  etc.  Each 
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of  the  1024  identical  custom  built  processors  has  “local”  memory  of  128K  bytes  and  is 
connected  to  10  other  processors.  The  processor  in  the  NCUBE  is  a  general  purpose  32 
bit  processor  and  includes  high  speed  32  and  64  bit  IEEE  floating  point. 

It  is  important  to  view  the  NCUBE  in  relation  to  a  host  Intel  286  minicomputer  because 
code  may  be  loaded  onto  the  processors  (interchangeably  called  “nodes”)  only  from  the 
host.  Parallelism  in  the  NCUBE  is  very  natural  after  the  host  ships  out  .(perhaps  different) 
programs  to  the  different  processors.  They  start  working  on  the  data  (provided  by  the 
host  or  some  other  node)  and  synchronization  is  achieved  by  sending  messages.  At  the 
current  stage,  the  main  mechanism  for  programming  is  a  modified  FORTRAN  language. 
Parallelism  will  be  reduced  whenever  the  processors  are  idle,  perhaps  due  to  improper  load 
balancing,  and  when  messages  are  received,  sent  or  forwarded  by  the  processors. 

The  Butterfly  is  a  tightly  coupled  “shared-memory”  architecture  machine  with  at  most 
256  MC68000  processors.  Each  processor  sits  on  a  card  with  a  process  node  controller 
(PNC)  and  a  megabyte  of  memory  (which  in  some  sense  is  local  to  this  processor).  Pro¬ 
cessors  are  interconnected  in  a  “butterfly”  [KS84]  fashion,  thus  enabling  processors  to  gain 
quick  access  to  memory  located  physically  on  a  different  card.  There  is  a  single  clock  in  the 
whole  machine  but  asynchronous  multiple  instructions  may  be  executed  by  the  different 
processors  on  multiple  data. 

The  standard  approach  to  programming  the  Butterfly  is  the  so-called  Uniform  System 
(Cro86j  wherein  computational  tasks  that  comprise  an  application  are  emphasized  and 
the  notion  of  processes  is  less  relevant.  The  Uniform  system  views  the  conglomeration  of 
memory  as  one  huge  shared  memory  seen  in  the  address  space  of  all  identical  processors 
and  seeks  to  scatter  application  data  to  reduce  memory  contention.  Parallelism  is  exploited 
by  generating  tasks  using  a  “generator"  (similar  to  the  map  function  in  LISP).  This  is 


typically  visualized  as  a  procedure  which  has  as  its  arguments  a  description  of  the  data  and 
a  worker  function  to  analyze  the  data;  the  generator  results  in  activating  processors  with 


different  subsets  of  the  data.  The  form  of  this  subset  is  generally  limited  in  the  current 
release  (version  2.2.3)  and  typically  is  a  row  of  a  matrix. 

Parallelism  in  the  Butterfly  suffers  because  of  memory  contention,  switch  contention 
and  overhead  involved  in  generating  the  tasks.  Unfortunately,  in  spite  of  the  existence 
of  a  scheduler,  it  is  not  possible  to  generate  more  active  processes  than  processors  in  the 
Uniform  system.  This  has  imposed  a  limitation  on  our  experiments.  Nor  is  it  possible  for 
processors  to  act  “independently”  since  the  same  code  is  loaded  on  all  the  processors. 

4  The  Algorithms 

We  present  our  parallel  algorithms  in  this  Section.  As  mentioned  eaxlier,  one  has  to 
compute  the  Hough  array  and  then  extract  peaks.  On  a  systolic  machine,  such  as  the 
Warp,  one  may  be  interested  in  detecting  the  peaks  even  while  the  Hough  array  is  being 
calculated,  but  on  our  machines  there  is  no  obvious  advantage  to  this  approach.  We 
construct  the  Hough  array  and  detect  the  peaks  in  a  sequential  manner  but  in  both  the 
steps  we  would  like  to  allocate  processors  optimally. 

4.1  Computing  the  Hough  array 

Let  flmjn  to  0max  determine  the  relevant  range  of  angles.  One  may  think  of  computing  the 
Hough  array  in  terms  of  mapping  a  three-dimensional  xxyxO  space  into  a  one-dimensional 
p  space  and  then  incrementing  the  appropriate  (p,  6)  cells  by  1. 

At  this  stage,  it  is  not  obvious  how  the  processors  should  be  allocated.  For  instance, 
we  can  identify  the  following  options: 

•  Distribute  processing  by  z:  Assign  one  row  of  the  binary  image  to  each  processor. 
It  then  computes  p  over  the  entire  range  of  angles.  (One  could  conceivably  use  a 
column  instead  of  a  row.) 


•  Distribute  processing  by  region:  Assign  a  square  shaped  region  of  the  input  binary 
image  to  a  processor.  It  then  computes  p  over  the  entire  range  of  angles. 

•  Distribute  processing  by  6:  Assign  one  angle  to  each  processor.  Jt  then  computes  p 
for  the  entire  binary  image. 

•  Distribute  processing  by  6:  As  above,  assign  a  processor  to  an  angle.  However,  the 
processor  analyzes  only  a  part  of  the  input  image. 

As  an  example,  consider  scaling  the  problem  down  to  an  input  picture  of  size  1x2, 
an  angle  spread  of  5°  and  an  angular  quantization  of  1°  (an  assumption  we  will  make 
throughout,  for  simplicity).  Assuming  that  we  have  only  two  processors,  we  immediately 
see  that  it  is  preferable  to  distribute  processing  by  0.  The  only  thing  that  we  can  hope 
for  is  that  the  input  image  is  random.  If  we  allocate  one  processor  to  each  of  the  pixels, 
then  one  processor  may  end  up  doing  no  work  at  all  since  it  may  not  be  the  case  that 
both  the  pixels  are  edge  pixels.  On  the  other  hand,  distributing  by  0  implies  that  both 
processors  are  working  at  all  times  and  will,  in  general,  lead  to  better  processor  allocation. 
This  corresponds  to  the  case  of  “Distribute  processing  by  0 ”  above  and  will  be  the  pivotal 
idea  on  which  we  shall  base  the  allocation  upon. 

We  present  an  allocation  scheme  for  the  general  case  (other  schemes  are  also  possible). 
Let  ^  =  0mMX  —  0min  +  1  and  let  N  be  the  number  of  rows  in  the  input  image.  We  will 
consider  the  following  three  possible  cases: 


p  <  * 

(1) 

V  <  P  <  VN 

(2) 

*N  <P 

(3) 
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In  the  first  case,  the  ith  processor  computes  the  Hough  array  for  a  set  of  0s;  specifically, 
^min  "t"  **^min  "H  ®  "h  Pi  •  •  •  »  0min  +  *  +  [^(modu/oP)]  3  over  the  entire  binary  image. 

In  the  second  case,  the  unit  of  processing  given  to  a  processor  is  a  row  of  the  input 
image  and  an  angle.  For  each  edge  point  in  the  row,  the  processor  computes  (or  looks  up 
in  a  table)  the  appropriate  value  of  p.  Each  processor  can  be  given  either  overlapping  sets 
of  rows  or  “scattered”  sets  (our  implementation). 

Finally,  in  the  last  case,  the  unit  of  processing  is  now  an  angle  and  part  of  the  input 
row.  Thus,  for  instance,  if  P  =  2 then  processors  will  be  working  on  an  angle  and  half 
a  row  of  the  input  image. 

4.1.1  Computational  complexity  for  the  Hough  array  computation 

The  analysis  for  this  part  is  easy;  it  illustrates  the  relation  between  the  problem  size  and 
the  efficiency  of  the  parallel  computation. 

Once  the  input  space  is  allocated  to  the  different  processors  as  described  above,  each 
processor  independently  computes  a  “version”  of  the  Hough  array  based  on  the  subset  of 
the  problem  given  to  it.  The  Hough  arrays  are  then  collapsed  into  the  complete  Hough 
array  using  the  standard  [PV81]  collapsing  technique.  Thus,  on  the  hypercube,  one  could 
use  the  following  procedure,  Algorithm  Merge,  to  merge  the  Hough  arrays  (here,  mynodeid 
is  a  unique  address  for  a  hypercube  node): 

for  index  =  1  to  log2  P 
begin 

if  mynodeid  <  2tndez  then 

neighbor  =  mynodeid  xor  2,ndez~1 
if  neighbor  <  mynodeid  then 

send  (myhougharray,  neighbor) 

3  A  different  partioning  where  each  processor  analyses  contiguous  angles  is  possible. 


begin 

receive  (hishougharray,  neighbor) 
myhougharray  =  myhougharray  +  hishougharray 
end 

end 

Let  Hsize  represent  the  size  of  the  Hough  array.  For  the  purpose  of  analysis  we  assume 
that  every  pixel  in  the  input  N  x  N  image  is  an  edge  pixel  in  the  worst  case  and  that 
Hsize  is  asymptotically  smaller  than  the  input  N.  Then,  in  this  model  of  computation, 


TP(N)  =  0(N*/P)  +  H-size  log  P  (4) 

which  we  shall  approximate  as 

TP(N)  =  0(N1/P).  (5) 

Recall  that 

Ti(N)  =  0(N2).  (6) 

Munro  and  Patterson  [MP71]  have  a  lower  bound  result  which  states  that 

Tp(N)  >  Ti(N)/P  +  logj (min(P, 2\ [N)))  -  1.  (7) 

If  we  substitute  equation  6  into  equation  7,  we  have 

TP[N)  >  0(N*/P)  +  log2(min (P,N2))  -  1  (8) 


and  thus  asymptotically,  this  result  is  optimal  within  the  constraints  of  our  assumption. 
(See  eqn.  5.) 

As  discussed  in  the  introduction,  it  is  useful  to  see  for  what  values  of  input  size  this 
algorithm  achieves  optimal  speedup.  Essentially  we  want  to  determine  when 


TP(N)  =  Tt(N)/P 


viewing  P  and  N  in  turn  as  the  independent  variable.  We  thus  need  to  solve  the  equation 


N2/P  +  H-size  logj  P  =  0(N2/P) 

The  solution  to  this  is 

N  =  Sl{H-sizeP  log  P)1'2. 

The  problem  size  should  be  at  least  (P  logP)1/2  for  the  algorithm  to.be  optimal  or  the 
DEF  is  (PlogP)1/2. 

Reversing  the  question  we  have,  for  optimality, 

P  =  N2/\ogN 


since  the  term 


Hsizt  log2  P 


evaluates  to 

2  H^size  log  (N/  log  N) 


or 

0(Iog  N) 

In  other  words,  the  algorithm  exploits  at  most  N 2  /  log  N  processors  to  solve  a  problem  of 
size  N  and  this  gives  an  idea  of  how  good  the  algorithm  is  (the  PEF  is  N2  /  logN).  To 
summarize,  the  algorithm  is  in  the  Class  J(plog/>)i/a  and  Class  K^/iogN- 


4.1.2  Implementation  details 

Recall  from  the  description  in  Section  3  that  on  the  Butterfly  we  do  not  have  the  ability 
to  increase  the  number  of  processes  beyond  the  actual  number  of  physical  processors,  i.e. 
one  cannot  have  virtual  processors.  This  limits  the  set  of  experiments  that  we  can  perform 
since  we  cannot  measure  the  speedup  beyond  16  processors. 


We  remarked  earlier  that  there  are  other  data  allocation  schemes  that  may  also  achieve 
similar  performance.  Our  scheme  has  the  additional  merit  that  on  the  Butterfly  it  is  easy  to 
implement.  When  the  Uniform  System  generator  produces  tasks  to  be  executed  in  parallel, 
it  produces,  conceptually,  an  identification  number  for  each  task  and  different  processors 
execute  tasks  corresponding  to  different  identification  numbers.  In  our  situation,  when 
we  allocated  angles  to  each  processor  (corresponding  to  Equation  1  in  Section  4.1),  these 
numbers  correspond  to  the  different  angles  since  now  the  unit  of  processing  is  an  angle.  As 
a  result,  the  P  processors  evaluate  the  first  P  tasks  in  parallel  and  then  process  the  next 
P  tasks.  (Each  processor  thereby  does  not  work  on  the  task  corresponding  to  contiguous 
angles.)  In  the  next  situation  (see  Equation  2  in  Section  4.1)  we  decided  to  allocate  rows  of 
the  input  image  for  a  similar  reason  and  thus  each  processor  does  not  work  on  contiguous 
rows  but  rather  on  a  scattered  set  of  rows.  It  also  helps  that  adjacent  elements  in  a  row 
of  the  matrix  are  also  physically  adjacent  even  though  adjacent  rows  are  scattered  since 
block  transfer  of  physically  adjacent  memory  cells  from  global  memory  to  the  local  memory 
of  the  processor  is  efficient  on  the  Butterfly.  Thus  there  is  no  contention  when  different 
processors  are  accessing  the  input  image  in  parallel. 

Once  the  individual  Hough  array  versions  are  computed  by  the  different  processors,  we 
need  to  merge  them  into  one  “correct”  Hough  array.  Communication  between  processors 
on  the  shared  memory  Butterfly  is  achieved  by  directly  accessing  global  memory,  so  in 
Algorithm  Merge,  we  need  not  explicitly  send  or  receive  the  individual  Hough  array  versions 
to  be  merged.  However,  the  “versions”  that  the  different  processors  compute  have  to  be 
allocated  in  global  memory  and  there  is  an  obvious  duplication  here;  there  are  bound  to 
be  situations  where  this  might  be  a  limitation  since,  in  particular,  the  current  Uniform 
System  forbids  piecewise  freeing  of  dynamic  memory. 

Our  implementation  defined  the  “final”  Hough  array  in  global  memory.  The  individual 
versions  are  never  explicitly  built  but  instead  the  copy  in  global  memory  is  updated.  It 


is  well  known  that  updating  in  global  memory  requires  that  the  information  be  added 
atomically;  so  we  use  the  “locking”  mechanisms  available  in  Chrysalis  to  ensure  integrity. 
Thus,  there  is  no  need  to  allocate  memory  in  the  global  address  space  for  the  individual 
versions. 

Undoubtedly  this  approach  reduces  efficiency;  however,  one  can  manipulate  the  pro¬ 
gram  variables  to  take  advantage  of  the  local  and  global  memory  aspects  of  the  shared 
memory  and  reduce  the  contentions.  Since  programming  on  the  Butterfly  is  done  in  C,  we 
often  need  to  follow  a  sequence  of  pointers  in  order  to  access  the  “locked”  data  structure.  A 
useful  strategy  in  such  cases  is  to  make  local  copies  of  the  data  thus  restricting  contention 
only  at  the  final  destination  memory  word.  In  fact,  for  the  Hough  transform  done  on  a 
window  of  an  image  for  the  Autonomous  Land  Vehicle,  we  observed  linear  speedup  for  a 
32  x  32  input  binary  image  with  approximately  twenty-five  percent  edge  pixels  when  the 
angle  spread  was  about  twenty  degrees. 

The  same  idea  may  be  used  even  in  the  processor  allocation  step.  In  addition,  accesses 
to  local  memory  cost  less  than  accesses  to  global  memory;  so  a  “block”  of  data  may 
be  transferred  to  local  memory  and  then  used  in  computations.  Again,  when  processors 
are  allocated  parts  of  the  input  matrix,  allocation  is  done  altematingly  from  the  top  and 
the  bottom.  This  results  in  less  contention  while  accessing  common  data,  for  instance, 
overlapping  rows. 

Though  the  algorithm  suited  the  NCUBE  very  well,  it  was  in  its  infancy  of  software 
development  when  the  algorithm  was  implemented.  For  instance,  the  time  spent  in  execu¬ 
tion  of  the  program  viz.  the  duration  between  sending  of  the  node  program  and  receiving 
the  information  back  to  the  host  (recall  the  discussion  in  Section  3)  cannot  be  easily  be 
measured.  Secondly,  it  is  possible  to  pick  messages  out  of  order  in  which  they  have  been 
sent,  i.e.  by  the  type  of  message.  Finally,  input/output  is  possible  only  from  the  host  and 
this  restricts  the  pace  of  debugging  programs. 
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4.2  Peak  detection 


More  interesting  theoretical  ideas  are  involved  in  searching  for  the  k  highest  peaks  once 
the  accumulator- array  is  available  and  is  in  fact  the  subject  of  another,  report  [CR86].  We 
propose  different  solutions  here  and  consider  their  relative  advantages. 

Using  a  single  processor,  the  problem  cannot  be  solved  faster  than  0(N )  because  the 
kth  largest  value  could  be  the  Nth  value  (here  N  refers  to  the  size  of  the  Hough  array 
and  not  the  image  size).  Recall  further  that  one  can  find  the  median  of  a  list  of  numbers 
(or  in  general,  the  kth  largest  element)  in  linear  time  [AHU74].  Once  we  do  this,  we  can 
look  at  one  element  at  a  time,  compare  it  to  the  kth  largest  element  and  thus  find  all  the 
desired  elements  in  linear  time.  Thus,  although  we  search  for  the  k  highest  peaks,  our 
problem  is  in  fact  no  more  difficult  than  the  median  finding  problem.  Our  goal  is  to  solve 
the  problem  in  constant  time  using  N  processors  and,  in  general,  to  solve  it  in  N/P  time 
using  P  processors. 

At  first,  one  is  inclined  to  sort  the  Hough  bins  using  the  accumulator  value  as  the  key. 
(Once  sorting  is  done,  it  is  trivial  to  pick  the  k  highest  peaks.)  Using  N  processors,  we 
cannot  sort  in  less  than  0{\og  N)  time  [AKS81J.  As  far  as  implementations  go,  though 
the  above  is  fast,  it  is  not  practical  because  the  constant  hidden  in  front  of  the  0(\ogN) 
description  is  very  large  (about  a  million).  The  value  of  k  in  most  applications  is  low  and 
the  following  algorithm  is  an  efficient  alternative  (though  asymptotically  not  faster  than 
sorting). 

Processor  allocation  is  done  by  splitting  (but  not  partitioning)  the  Hough  array  into 
P  strips.  When  successive  strips  share  a  common  row,  then  each  pixel  and  its  3  x  3 
neighborhood  is  contained  in  some  single  strip  (except  for  the  border  pixels).  Then  each 
processor  using  a  fast  sorting  algorithm  finds  the  first  k  elements  in  its  data.  We  then 
combine  the  individual  k  peaks  in  a  tree-like  fashion  to  produce  the  final  desired  k  peaks. 
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This  algorithm  may  be  easily  implemented  both  on  the  Butterfly  and  on  the  NCUBE. 
It  is  easy  to  observe  that 

TP(N)  =  0(N/P  log {N/P)  +  k  log  P), 

the  first  term  arising  from  the  sorting  part  and  the  second  from  the  rperging.  Asymptoti¬ 
cally,  this  is  not  a  good  algorithm  but  for  “reasonable"  values  of  k  and  N,  it  is  satisfactory. 
In  fact,  this  is  the  version  we  have  implemented. 

If  we  are  indeed  interested  in  asymptotics,  we  observe  that  we  can  use  the  linear  median 
finding  algorithm  instead  of  the  fast  sorting  algorithm  in  the  initial  stage  (i.e.  when  each 
processor  finds  the  k  highest  peaks  in  its  data).  The  merging  of  these  is  done  as  before. 
In  this  case, 

TP{N)  =  0(N/P  +  k  log  P).  (9) 

Once  again,  we  are  interested  in  when,  as  a  function  of  N  and  P,  this  algorithm  achieves 
a  speedup  of  P  over  the  sequential  linear  running  time.  If  the  problem  size  is  at  least 
fcPlogP  optimal  efficiency  is  achieved  (the  DEF  is  fcP  log  P)  4.  Once  again,  reversing  the 
question,  if  we  have  at  most  N/{k  log  JV)  processors,  then  substituting  into  equation  9,  we 
have 

Tp(N)  =  0(k\ogN  +  klog{N/k  log N)) 
i.e. 

Tp(N)  =  0(k  log  N) 

and  thus 

PTP{N)  =  O(N) 

the  uniprocessor  time.  The  problem  is  solved  optimally  and  to  summarize,  the  algorithm 
is  in  the  Glass  J(tpiogP)  &nd  Class  i£jv/(fciogjv)* 

4  Of  course,  here  we  ere  assuming  that  k  ie  not  e  parameter  to  the  problem;  [CR86J  describes  an  algorithm 
whose  running  time  is  independent  of  k. 
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5  Conclusions 


The  Hough  transform  is  a  technique  to  detect  line  segments  in  an  image  by  finding  global 
consistencies  in  the  data.  Since  it  considers  every  feature  point,  the  process  is  slow  on 
a  conventional  Von  Neumann  machine.  There  is,  however,  a  certain  independence  in  the 
Hough  array  computation  and  whenever  such  independence  exists  in  any  problem,  the  use 
of  machines  whose  architecture  is  not  Von  Neumann  is  justified.  Matrix  multiplication  is 
an  example  of  a  problem  wherein  all  the  computations  are  independent  and  is  an  excellent 
test  problem  for  a  parallel  machine.  An  appropriate  partitioning  of  the  data  is  required 
in  such  problems  and  in  this  paper,  we  have  identified  the  "independent”  component  in 
Section  4.1.1  and  shown  how  to  allocate  processors  on  the  NCUBE  and  the  Butterfly.  Once 
the  Hough  array  is  computed,  one  is  usually  interested  in  detecting  peaks  in  order  to  find 
the  best  line  segments.  We  have  shown  that  this  problem  is  no  more  difficult  than  the 
median  finding  problem  and  proceeded  to  obtain  an  optimal  parallel  selection  algorithm. 
Here  the  partitioning  of  data  is  less  critical  and  the  communication  costs  dominate  the 
efficiency  of  the  parallel  computation. 

An  orthogonal  issue  that  we  have  addressed  in  this  paper  is  the  relation  between  prob¬ 
lem  size  and  the  number  of  processors.  Given  a  machine  of  a  fixed  configuration,  it  is 
important  to  find  the  appropriate  problem  size  to  determine  the  efficacy  of  the  architec¬ 
ture;  in  most  problems,  if  the  problem  size  is  sufficiently  large,  one  is  bound  to  get  a  linear 
speedup  on  the  finite  number  of  processors  that  exist  on  the  machine.  We  are  interested  in 
finding  bounds  on  this  parameter.  Alternatively,  when  one  has  a  fixed  application,  and  the 
problem  size  is  constrained  by  other  issues,  one  is  interested  in  determining  a  reasonable 
number  of  processors  that  one  should  use  for  the  problem.  It  is  clear  that  not  all  the 
processors  may  be  useful  if,  for  instance,  the  problem  is  too  small.  By  introducing  the 
idea  of  processor  and  data  efficiency  functions,  we  have  illustrated  a  technique  to  obtain 


bounds  on  these  numbers. 
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